Search CORE

319 research outputs found

Straight to Shapes: Real-time Detection of Encoded Shapes

Author: Golodetz Stuart
Jetley Saumya
Sapienza Michael
Torr Philip H. S.
Publication venue
Publication date: 05/07/2017
Field of study

Current object detection approaches predict bounding boxes, but these provide little instance-specific information beyond location, scale and aspect ratio. In this work, we propose to directly regress to objects' shapes in addition to their bounding boxes and categories. It is crucial to find an appropriate shape representation that is compact and decodable, and in which objects can be compared for higher-order concepts such as view similarity, pose variation and occlusion. To achieve this, we use a denoising convolutional auto-encoder to establish an embedding space, and place the decoder after a fast end-to-end network trained to regress directly to the encoded shape vectors. This yields what to the best of our knowledge is the first real-time shape prediction network, running at ~35 FPS on a high-end desktop. With higher-order shape reasoning well-integrated into the network pipeline, the network shows the useful practical quality of generalising to unseen categories similar to the ones in the training set, something that most existing approaches fail to handle.Comment: 16 pages including appendix; Published at CVPR 201

arXiv.org e-Print Archive

Crossref

Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

Author: Cuzzolin Fabio
Saha Suman
Sapienza Michael
Singh Gurkirt
Torr Philip H. S.
Publication venue
Publication date: 01/01/2016
Field of study

In this work, we propose an approach to the spatiotemporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. Our framework is composed of three stages. In stage 1, appearance and motion detection networks are employed to localise and score actions from colour images and optical flow. In stage 2, the appearance network detections are boosted by combining them with the motion detection scores, in proportion to their respective spatial overlap. In stage 3, sequences of detection boxes most likely to be associated with a single action instance, called action tubes, are constructed by solving two energy maximisation problems via dynamic programming. While in the first pass, action paths spanning the whole video are built by linking detection boxes over time using their class-specific scores and their spatial overlap, in the second pass, temporal trimming is performed by ensuring label consistency for all constituting detection boxes. We demonstrate the performance of our algorithm on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new state-of-the-art results across the board and significantly increasing detection speed at test time. We achieve a huge leap forward in action detection performance and report a 20% and 11% gain in mAP (mean average precision) on UCF-101 and J-HMDB-21 datasets respectively when compared to the state-of-the-art.Comment: Accepted by British Machine Vision Conference 201

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Oxford Brookes University: RADAR

InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure

Author: Cavallari Tommaso
Golodetz Stuart
Kähler Olaf
Murray David W
Prisacariu Victor Adrian
Sapienza Michael
Torr Philip H S
Publication venue
Publication date: 02/08/2017
Field of study

Volumetric models have become a popular representation for 3D scenes in recent years. One breakthrough leading to their popularity was KinectFusion, which focuses on 3D reconstruction using RGB-D sensors. However, monocular SLAM has since also been tackled with very similar approaches. Representing the reconstruction volumetrically as a TSDF leads to most of the simplicity and efficiency that can be achieved with GPU implementations of these systems. However, this representation is memory-intensive and limits applicability to small-scale reconstructions. Several avenues have been explored to overcome this. With the aim of summarizing them and providing for a fast, flexible 3D reconstruction pipeline, we propose a new, unifying framework called InfiniTAM. The idea is that steps like camera tracking, scene representation and integration of new data can easily be replaced and adapted to the user's needs. This report describes the technical implementation details of InfiniTAM v3, the third version of our InfiniTAM system. We have added various new features, as well as making numerous enhancements to the low-level code that significantly improve our camera tracking performance. The new features that we expect to be of most interest are (i) a robust camera tracking module; (ii) an implementation of Glocker et al.'s keyframe-based random ferns camera relocaliser; (iii) a novel approach to globally-consistent TSDF-based reconstruction, based on dividing the scene into rigid submaps and optimising the relative poses between them; and (iv) an implementation of Keller et al.'s surfel-based reconstruction approach.Comment: This article largely supersedes arxiv:1410.0925 (it describes version 3 of the InfiniTAM framework

arXiv.org e-Print Archive

Oxford University Research Archive

Learning discriminative space-time actions from weakly labelled videos

Author: Michael Sapienza
Fabio Cuzzolin
Philip Torr
Publication venue: British Machine Vision Association
Publication date: 12/07/2002
Field of study

International audienc

Crossref

A generative traversability model for monocular robot self-guidance

Author: 9th International Conference on Informatics in Control Automation and Robotics, ICINCO 2012
Camilleri Kenneth P.
Sapienza Michael
Publication venue: ICINCO - IEEE Robotics and Automation Society
Publication date: 01/01/2012
Field of study

The research work disclosed in this publication is partially funded by the Strategic Educational Pathways Scholarship (Malta). The scholarship is part-financed by the European Union - European Social Fund (ESF) under the Operational Programme II - Cohesion Policy 2007-2013, Empowering People for More Jobs and a Better Quality of Life.In order for robots to be integrated into human active spaces and perform useful tasks, they must be capable of discriminating between traversable surfaces and obstacle regions in their surrounding environment. In this work, a principled semi-supervised (EM) framework is presented for the detection of traversable image regions for use on a low-cost monocular mobile robot. We propose a novel generative model for the occurrence of traversability cues, which are a measure of dissimilarity between safe-window and image superpixel features. Our classification results on both indoor and outdoor images sequences demonstrate its generality and adaptability to multiple environments through the online learning of an exponential mixture model. We show that this appearance-based vision framework is robust and can quickly and accurately estimate the probabilistic traversability of an image using no temporal information. Moreover, the reduction in safe-window size as compared to the state-of-the-art enables a self-guided monocular robot to roam in closer proximity of obstacles.peer-reviewe

OAR@UM

Online real-time multiple spatiotemporal action localisation and prediction

Author: Cuzzolin Fabio
Saha Suman
Sapienza Michael
Singh Gurkit
Torr Philip
Publication venue
Publication date: 01/01/2017
Field of study

We present a deep-learning framework for real-time multiple spatio-temporal (S/T) action localisation, classification and early prediction. Current state-of-the-art approaches work offline, and are too slow be useful in realworld settings. To overcome their limitations we introduce two major developments. Firstly, we adopt real-time SSD (Single Shot MultiBox Detector) convolutional neural networks to regress and classify detection boxes in each video frame potentially containing an action of interest. Secondly, we design an original and efficient online algorithm to incrementally construct and label ‘action tubes’ from the SSD frame level detections. As a result, our system is not only capable of performing S/T detection in real time, but can also perform early action prediction in an online fashion. We achieve new state-of-the-art results in both S/T action localisation and early action prediction on the challenging UCF101-24 and J-HMDB-21 benchmarks, even when compared to the top offline competitors. To the best of our knowledge, ours is the first real-time (up to 40fps) system able to perform online S/T action localisation and early action prediction on the untrimmed videos of UCF101-24

arXiv.org e-Print Archive

Oxford University Research Archive

Oxford Brookes University: RADAR